Lexicon Induction for Spoken Rusyn - Challenges and Results
نویسندگان
چکیده
This paper reports on challenges and results in developing NLP resources for spoken Rusyn. Being a Slavic minority language, Rusyn does not have any resources to make use of. We propose to build a morphosyntactic dictionary for Rusyn, combining existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish. We adapt these resources to Rusyn by using vowel-sensitive Levenshtein distance, hand-written language-specific transformation rules, and combinations of the two. Compared to an exact match baseline, we increase the coverage of the resulting morphological dictionary by up to 77.4% relative (42.9% absolute), which results in a tagging recall increased by 11.6% relative (9.1% absolute). Our research confirms and expands the results of previous studies showing the efficiency of using NLP resources from neighboring languages for low-resourced languages.
منابع مشابه
Multi-source morphosyntactic tagging for spoken Rusyn
This paper deals with the development of morphosyntactic taggers for spoken varieties of the Slavic minority language Rusyn. As neither annotated corpora nor parallel corpora are electronically available for Rusyn, we propose to combine existing resources from the etymologically close Slavic languages Russian, Ukrainian, Slovak, and Polish and adapt them to Rusyn. Using MarMoT as tagging toolki...
متن کاملModality-independent Effects of Phonological Neighborhood Structure on Initial L2 Sign Language Learning
The goal of the present study was to characterize how neighborhood structure in sign language influences lexical sign acquisition in order to extend our understanding of how the lexicon influences lexical acquisition in both sign and spoken languages. A referentmatching lexical sign learning paradigm was administered to a group of 29 hearing sign language learners in order to create a sign lexi...
متن کاملIncreasing The Coverage Of A Domain Independent Dialogue Lexicon With VERBNET
This paper investigates how to extend coverage of a domain independent lexicon tailored for natural language understanding. We introduce two algorithms for adding lexical entries from VERBNET to the lexicon of the TRIPS spoken dialogue system. We report results on the efficiency of the method, discussing in particular precision versus coverage issues and implications for mapping to other lexica...
متن کاملThe Trilingual ALLEGRA Corpus: Presentation and Possible Use for Lexicon Induction
In this paper, we present a trilingual parallel corpus for German, Italian and Romansh, a Swiss minority language spoken in the canton of Grisons. The corpus called ALLEGRA contains press releases automatically gathered from the website of the cantonal administration of Grisons. Texts have been preprocessed and aligned with a current state-of-the-art sentence aligner. The corpus is one of the f...
متن کاملA robust fusion method for multilingual spoken document retrieval systems employing tiered resources
In this study, we present two novel fusion approaches to merge subword and word based retrieval methods within a multilingual spoken document retrieval (SDR) system. Considering the fact that more than 6000 languages are spoken in the world today, resources (e.g., text and audio data, pronunciation lexicon) needed to develop Automatic Speech Recognition (ASR) systems for such a range of languag...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2017